A Rule Based Taxonomy of Dirty Data
نویسندگان
چکیده
There is a growing awareness that high quality of data is a key to today’s business success and that dirty data existing within data sources is one of the causes of poor data quality. To ensure high quality data, enterprises need to have a process, methodologies and resources to monitor, analyze and maintain the quality of data. Nevertheless, research shows that many enterprises do not pay adequate attention to the existence of dirty data and have not applied useful methodologies to ensure high quality data for their applications. One of the reasons is a lack of appreciation of the types and extent of dirty data. In practice, detecting and cleaning all the dirty data that exists in all data sources is quite expensive and unrealistic. The cost of cleaning dirty data needs to be considered for most of enterprises. This problem has not attracted enough attention from researchers. In this paper, a rule-based taxonomy of dirty data is developed. The proposed taxonomy not only provides a mechanism to deal with this problem but also includes more dirty data types than any of existing such taxonomies. KeywordsDirty data; data quality; data cleaning;
منابع مشابه
دکترین سوء استفاده از حق در مالکیت ادبی و هنری
The abuse of right doctrine is based on the Equity Rule and the doctrine of Dirty hands. Dirty hands prohibit the owner of right from receiving unfair loss because of incorrect operation based on bad faith. For example, the owner of right who use quasi anticompetitive proceedings in Literary and artistic property domain, can be prohibited from abusing his authorities by operating this doctrine....
متن کاملQualitative Data Cleaning
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Data cleaning exercise often consist of two phases: error detection and error repairing. Error detection techniques can either be quantitative or qualitative; and error repairing is performed by applying data transformation script...
متن کاملA Taxonomy of Dirty Time-Oriented Data
Data quality is a vital topic for business analytics in order to gain accurate insight and make correct decisions in many data-intensive industries. Albeit systematic approaches to categorize, detect, and avoid data quality problems exist, the special characteristics of time-oriented data are hardly considered. However, time is an important data dimension with distinct characteristics which aff...
متن کاملData Cleaning using Probabilistic Models of Integrity Constraints
In data cleaning, data quality rules provide a valuable tool for enforcing the correct application of semantics on a dataset. Traditional rule discovery techniques assume a reasonably clean dataset, and fail when faced with a dirty one. Enforcement of these rules for error detection is much less effective when mined on dirty data. In the databases literature, a popular and expressive type of lo...
متن کاملData Quality in Very Large, Multiple-Source, Secondary Datasets for Data Mining Applications
The data mining research community is increasingly addressing data quality issues, including problems of dirty data. Hand, Blunt, Kelly and Adams (2000) have identified high-level and low-level quality issues in data mining. Kim, Choi, Hong, Kim and Lee (2003) have compiled a useful, complete taxonomy of dirty data that provides a starting point for research in effective techniques and fast alg...
متن کامل